Author: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Date: October 22, 2020 (Last revised: June 3, 2021)
Link: https://arxiv.org/abs/2010.11929
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Introduction
The Vision Transformer (ViT) revolutionized computer vision by demonstrating that pure transformer architectures—without any convolutional components—can achieve state-of-the-art performance on image classification tasks. This breakthrough challenged the long-held assumption that convolutional neural networks (CNNs) are essential for computer vision.
The CNN Dominance Era
Before ViT, computer vision was dominated by convolutional neural networks:
- CNNs: Designed specifically for image data with inductive biases
- Attention in vision: Limited to augmenting CNNs or replacing specific components
- Architectural assumption: Convolutions were considered necessary for visual tasks
The Transformer architecture, despite revolutionizing NLP, had limited success in pure vision applications.
ViT's Revolutionary Insight
The key insight: CNNs are not necessary for excellent image recognition performance.
Instead of adapting Transformers to work with CNNs, ViT applies a pure Transformer directly to sequences of image patches, treating images similarly to how NLP models treat text sequences.
Architecture Overview
Image Patch Embedding
The core innovation is treating an image as a sequence of patches:
class PatchEmbedding(nn.Module):
def __init__(self, image_size=224, patch_size=16, in_channels=3, embed_dim=768):
"""
Convert image to sequence of patch embeddings
Args:
image_size: Input image size (assumes square images)
patch_size: Size of each patch (16x16 pixels)
in_channels: Number of input channels (3 for RGB)
embed_dim: Embedding dimension
"""
super().__init__()
self.image_size = image_size
self.patch_size = patch_size
self.num_patches = (image_size // patch_size) ** 2
# Linear projection of flattened patches
self.projection = nn.Conv2d(
in_channels,
embed_dim,
kernel_size=patch_size,
stride=patch_size
)
def forward(self, x):
# x shape: (batch_size, channels, height, width)
# Output shape: (batch_size, num_patches, embed_dim)
x = self.projection(x) # (B, embed_dim, H/P, W/P)
x = x.flatten(2) # (B, embed_dim, num_patches)
x = x.transpose(1, 2) # (B, num_patches, embed_dim)
return x
The "16x16 Words" Concept
For a 224×224 image with 16×16 patches:
Image: 224 × 224 pixels
Patch size: 16 × 16 pixels
Number of patches: (224/16) × (224/16) = 14 × 14 = 196 patches
Each patch becomes a "word" in the sequence!
This is the origin of the title: "An Image is Worth 16×16 Words"
Complete ViT Architecture
class VisionTransformer(nn.Module):
def __init__(
self,
image_size=224,
patch_size=16,
num_classes=1000,
embed_dim=768,
depth=12,
num_heads=12,
mlp_ratio=4.0
):
super().__init__()
# Patch embedding
self.patch_embed = PatchEmbedding(
image_size, patch_size, 3, embed_dim
)
num_patches = self.patch_embed.num_patches
# CLS token (for classification)
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# Positional embeddings
self.pos_embed = nn.Parameter(
torch.zeros(1, num_patches + 1, embed_dim)
)
# Transformer encoder blocks
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads, mlp_ratio)
for _ in range(depth)
])
# Classification head
self.norm = nn.LayerNorm(embed_dim)
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
B = x.shape[0]
# Patch embedding
x = self.patch_embed(x) # (B, num_patches, embed_dim)
# Add CLS token
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, x), dim=1) # (B, num_patches+1, embed_dim)
# Add positional embedding
x = x + self.pos_embed
# Transformer blocks
for block in self.blocks:
x = block(x)
# Classification
x = self.norm(x)
cls_output = x[:, 0] # Extract CLS token
logits = self.head(cls_output)
return logits
Transformer Block
Each transformer block contains standard components:
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads, mlp_ratio=4.0):
super().__init__()
# Multi-head self-attention
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.norm1 = nn.LayerNorm(embed_dim)
# Feed-forward network
mlp_hidden_dim = int(embed_dim * mlp_ratio)
self.mlp = MLP(embed_dim, mlp_hidden_dim)
self.norm2 = nn.LayerNorm(embed_dim)
def forward(self, x):
# Attention with residual connection
x = x + self.attention(self.norm1(x))
# MLP with residual connection
x = x + self.mlp(self.norm2(x))
return x
Model Variants
ViT comes in multiple sizes:
ViT-Base
- Layers: 12
- Hidden size: 768
- MLP size: 3072
- Heads: 12
- Parameters: 86M
ViT-Large
- Layers: 24
- Hidden size: 1024
- MLP size: 4096
- Heads: 16
- Parameters: 307M
ViT-Huge
- Layers: 32
- Hidden size: 1280
- MLP size: 5120
- Heads: 16
- Parameters: 632M
Training Strategy
Pre-training on Large Datasets
ViT's success relies heavily on pre-training with large datasets:
# Pre-training datasets
JFT-300M: 300 million images (Google internal)
ImageNet-21k: 14 million images, 21k classes
ImageNet-1k: 1.3 million images, 1k classes
# Training procedure
1. Pre-train on large dataset (JFT-300M or ImageNet-21k)
2. Fine-tune on downstream tasks
3. Achieve state-of-the-art results
The Data Scale Effect
A key finding: ViT requires more data than CNNs to reach optimal performance
Small datasets (ImageNet-1k only):
- CNNs: Excellent performance
- ViT: Underperforms CNNs
Large datasets (ImageNet-21k, JFT-300M):
- ViT: Matches or exceeds CNN performance
- Scales better with data
This is because:
- CNNs have built-in inductive biases for images (locality, translation equivariance)
- ViT learns these patterns from data
- With sufficient data, ViT learns better representations
Experimental Results
ImageNet Classification
When pre-trained on JFT-300M and fine-tuned on ImageNet:
| Model | Accuracy | Pre-training Cost |
|---|---|---|
| BiT-L (ResNet) | 87.5% | High |
| Noisy Student | 88.4% | Very High |
| ViT-H/14 | 88.6% | Lower |
Transfer Learning Performance
ViT excels at transfer learning across multiple benchmarks:
Few-shot learning:
- Competitive or better than CNNs
- Scales efficiently with pre-training data
VTAB (Visual Task Adaptation Benchmark):
- Strong performance across diverse tasks
- Better generalization than CNN counterparts
Computational Efficiency
Training Efficiency
# Comparison for similar performance
ResNet-152 (BigTransfer):
- Parameters: ~60M
- Pre-training: 9k TPUv3 core-days
ViT-L/16:
- Parameters: 307M
- Pre-training: 2.5k TPUv3 core-days
- Result: Better accuracy with ~4x less computation!
Inference Efficiency
ViT offers excellent inference throughput:
Patches processed in parallel (unlike CNN's sequential convolutions)
Self-attention allows global context at every layer
Efficient on modern hardware (GPUs/TPUs)
Attention Visualizations
One of ViT's advantages is interpretability through attention maps:
def visualize_attention(model, image, layer_idx=11):
"""
Visualize attention patterns in ViT
Args:
model: Trained ViT model
image: Input image
layer_idx: Which transformer layer to visualize
"""
# Get attention weights from specified layer
with torch.no_grad():
attentions = model.get_attention_maps(image)
attention = attentions[layer_idx]
# Average over heads
attention = attention.mean(dim=1)
# Attention from CLS token to patches
cls_attention = attention[0, 0, 1:]
# Reshape to image grid
num_patches = int(cls_attention.shape[0] ** 0.5)
attention_map = cls_attention.reshape(num_patches, num_patches)
# Visualize
plot_attention_map(attention_map, image)
Key observations from attention patterns:
- Early layers: Local, texture-focused attention
- Middle layers: Object parts and structure
- Late layers: Semantic, whole-object attention
Hybrid Architectures
The paper also explored ViT-hybrid models:
class HybridVisionTransformer(nn.Module):
def __init__(self, cnn_backbone, transformer_config):
super().__init__()
# CNN feature extractor (e.g., ResNet50)
self.cnn_backbone = cnn_backbone
# Use CNN features as "patches"
self.transformer = VisionTransformer(
patch_size=1, # Already extracted features
**transformer_config
)
def forward(self, x):
# Extract CNN features
features = self.cnn_backbone(x)
# Process with transformer
output = self.transformer(features)
return output
Findings:
- Hybrids perform slightly better at smaller scales
- Pure ViT outperforms hybrids at larger scales
- Demonstrates that CNNs are not necessary
Position Embeddings
ViT uses learnable position embeddings:
# 1D position embeddings
self.pos_embed = nn.Parameter(
torch.zeros(1, num_patches + 1, embed_dim)
)
# During training, the model learns spatial relationships
# Despite being 1D, it learns 2D spatial structure!
Experiments showed:
- 1D embeddings work as well as 2D
- Model learns spatial relationships from data
- Pre-training learns transferable position patterns
Impact on Computer Vision
ViT's impact has been transformative:
1. Architectural Paradigm Shift
Demonstrated that domain-specific architectures (CNNs) are not always necessary:
- Enabled architecture unification across domains
- Paved the way for multi-modal models
2. Scaling Laws
Showed that vision models scale similarly to NLP models:
- Performance improves with model size
- Performance improves with data size
- Follows predictable scaling laws
3. Research Directions
Inspired numerous follow-up works:
- DeiT: Data-efficient image transformers
- Swin Transformer: Hierarchical vision transformers
- BEiT: BERT pre-training for images
- MAE: Masked autoencoders for vision
- CLIP: Contrastive language-image pre-training
4. Industry Adoption
Widely adopted in production systems:
- Image classification
- Object detection (DETR, ViT-based detectors)
- Semantic segmentation
- Multi-modal understanding
Best Practices
When to Use ViT
# Good fit:
- Large-scale datasets available
- Pre-trained models for transfer learning
- Need for interpretability (attention maps)
- Multi-modal applications
# Consider alternatives (CNNs):
- Small datasets without pre-training
- Extremely resource-constrained environments
- Real-time applications requiring minimal latency
Fine-tuning ViT
# Load pre-trained model
model = VisionTransformer.from_pretrained('vit_large_patch16_224')
# Adjust classification head
model.head = nn.Linear(model.embed_dim, num_classes)
# Fine-tune with higher resolution (optional)
model = model.resize_positional_embedding(384) # 224 -> 384
# Training
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.05)
# Use cosine learning rate schedule
# Train for 20-100 epochs depending on dataset size
Limitations and Considerations
1. Data Requirements
- Requires large pre-training datasets
- May underperform CNNs on small datasets
- Pre-training is computationally expensive
2. Fixed Resolution
- Position embeddings are resolution-specific
- Requires interpolation for different resolutions
3. Quadratic Complexity
- Self-attention is O(n²) in sequence length
- Can be limiting for very high-resolution images
Future Directions
ViT opened several research avenues:
Efficient Transformers
Reducing computational complexity:
- Sparse attention patterns
- Linear attention mechanisms
- Hierarchical structures
Self-Supervised Learning
Better pre-training objectives:
- Masked image modeling
- Contrastive learning
- Multi-modal pre-training
Architecture Improvements
Enhancing the basic design:
- Adaptive patch sizes
- Dynamic depth/width
- Hybrid approaches
Conclusion
Vision Transformer represents a landmark achievement in computer vision. By demonstrating that pure transformer architectures can match or exceed CNN performance on image recognition tasks, ViT fundamentally challenged assumptions about what architectures are necessary for visual understanding.
Key Contributions:
- Architectural simplicity: Pure transformer, no convolutions needed
- Scalability: Excellent performance scaling with data and model size
- Efficiency: Superior computational efficiency compared to CNNs at scale
- Transferability: Strong transfer learning across diverse visual tasks
- Interpretability: Attention maps provide insights into model behavior
The impact of ViT extends beyond image classification, influencing object detection, segmentation, video understanding, and multi-modal learning. It established transformers as a universal architecture capable of handling diverse data modalities, paving the way for foundation models that can process text, images, audio, and more within a unified framework.
As the field continues to evolve, ViT's principles remain foundational to modern computer vision, and its legacy continues through countless derivative works and applications in both research and industry.
Citation:
@article{dosovitskiy2020image,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others},
journal={arXiv preprint arXiv:2010.11929},
year={2020}
}